The aim of this analysis is to investigate the effect of daily servings of the 5 main food groups (meat, fruit, veggies, grains, dairy) on measures of obesity (including, BMI, waist-to-height ratio [WtHR], and waist circumference). This will allow us to identify Which food groups are obesogenic, and which food groups are protective against obesity. This will be done with a linear regression model and selection by AIC/BIC.
We will then compare the effect of these food groups across the obesity measures, to see if they affect a certain obesity measure more than another. This will be done by comparing standardized beta coefficients.
Our secondary aim is to investigate the effect of overall diet composition on obesity. This will be done by conducting a PCA on the average daily servings of each meal and doing hierarchical clustering to identify natural patterns in the data. Then, the proportion of obese as defined by each of the obesity measures will be calculated for each cluster.
# selecting food groups
foods = tech_nutr %>% dplyr::select(ABSPID,
VEGLEG1N,
VEGLEG2N,
FRUIT1N,
FRUIT2N,
DAIRY1N,
DAIRY2N,
MEAT1N,
MEAT2N,
GRAINS1N,
GRAINS2N)
# getting average of food groups
avg_veges <- rowMeans(foods[ , c(2,3)], na.rm=TRUE)
avg_fruit <- rowMeans(foods[ , c(4,5)], na.rm=TRUE)
avg_dairy <- rowMeans(foods[ , c(6,7)], na.rm=TRUE)
avg_meat <- rowMeans(foods[ , c(8,9)], na.rm=TRUE)
avg_grains <- rowMeans(foods[ , c(10,11)], na.rm=TRUE)
dat <- cbind(tech_nutr$ABSPID,
avg_veges,
avg_fruit,
avg_dairy,
avg_meat,
avg_grains)
dat <- as_tibble(dat)
colnames(dat)[1] <- "ABSPID"
# convering average food servings to numeric
dat$avg_veges <- as.numeric(dat$avg_veges)
dat$avg_fruit <- as.numeric(dat$avg_fruit)
dat$avg_dairy <- as.numeric(dat$avg_dairy)
dat$avg_meat <- as.numeric(dat$avg_meat)
dat$avg_grains <- as.numeric(dat$avg_grains)
tech_biom1 = tech_biom %>% dplyr::select(c(1:53))
# merging average food servings with biomedical data
final = merge(dat, tech_biom1, by = "ABSPID")
final = final %>% dplyr::select(avg_veges,
avg_fruit,
avg_dairy,
avg_meat,
avg_grains,
BMISC,
AGEC,
SEX,
PHDCMHBC,
PHDCMWBC)
final = rename(final, BMI = BMISC)
final = rename(final, AGE = AGEC)
final = rename(final, HEIGHT = PHDCMHBC)
final = rename(final, WAIST_CIRCUMFERENCE = PHDCMWBC)
# calculating waist to height ratio
final$w2hratio = (final$WAIST_CIRCUMFERENCE)/(final$HEIGHT)
# to make results more robust, focus on adults only
final <- final %>% filter(AGE > 17)
# creating complete dataset by removing NAs.
final = final %>% na.omit()
The average daily servings of each of the 5 food groups was calculated using the tech_nutr dataset (across the two days) for each individual. Then, this average food data was linked to the biomedical data in tech_biom with the waist circumference, BMI, age, sex and height variables. Following this, the waist to height ratio was then calculated for each individual and included in the dataset as a new variable.
Finally, only obervations with Age \(\geq\) 18 were included (to include only adults in results) and observations with NA’s were omitted to give a complete dataset. The complete final dataset contained 7811 observations.
numeric_hist <- function(data, x) {
ggplot(data, aes_string(x = `x`)) +
geom_histogram(colour = "black", fill = "white")
}
p1 = numeric_hist(final, x = "avg_dairy")
p2 = numeric_hist(final, x = "avg_fruit")
p3 = numeric_hist(final, x = "avg_meat")
p4 = numeric_hist(final, x = "avg_grains")
p5 = numeric_hist(final, x = "avg_veges")
p6 = numeric_hist(final, x = "AGE")
gridExtra::grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2)
Figure 1: Distribution of predictor variables in dataset
All of the diet variables look fairly right skewed. Age on the other hand has a fairly uniform distribution, although it has some random peaks throughout.
### Linear regression with BMI as outcome
bmi_full <- lm(BMI ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX, dat = final)
bmi_null <- lm(BMI ~ 1, dat = final)
### linear regression with waist circumference as outcome
waist_full <- lm(WAIST_CIRCUMFERENCE ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX, dat = final)
waist_null <- lm(WAIST_CIRCUMFERENCE ~ 1, dat = final)
### Linear regression with waist to heigh ratio as outcome
w2hratio_full <- lm(w2hratio ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX, dat = final)
w2hratio_null <- lm(w2hratio ~ 1, dat = final)
### Robust regression with MM estimation
BMI_rlm <- MASS::rlm(BMI ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX, dat = final, method = "MM")
waist_rlm <- MASS::rlm(WAIST_CIRCUMFERENCE ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX, dat = final, method = "MM")
w2hratio_rlm <- MASS::rlm(w2hratio ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX, dat = final, method = "MM")
full_model_table <- tab_model(bmi_full,
waist_full,
w2hratio_full,
show.ci = F)
full_model_table
| Â | BMI | WAIST CIRCUMFERENCE | w 2 hratio | |||
|---|---|---|---|---|---|---|
| Predictors | Estimates | p | Estimates | p | Estimates | p |
| (Intercept) | 25.98 | <0.001 | 89.07 | <0.001 | 0.49 | <0.001 |
| avg veges | -0.12 | <0.001 | -0.29 | <0.001 | -0.00 | <0.001 |
| avg fruit | -0.20 | <0.001 | -0.57 | <0.001 | -0.00 | <0.001 |
| avg dairy | 0.10 | 0.141 | 0.18 | 0.255 | -0.00 | 0.275 |
| avg meat | 0.22 | <0.001 | 0.37 | 0.002 | 0.00 | 0.007 |
| avg grains | -0.17 | <0.001 | -0.38 | <0.001 | -0.00 | <0.001 |
| AGE | 0.05 | <0.001 | 0.23 | <0.001 | 0.00 | <0.001 |
| SEX [2] | -0.66 | <0.001 | -10.08 | <0.001 | -0.02 | <0.001 |
| Observations | 7811 | 7811 | 7811 | |||
| R2 / R2 adjusted | 0.041 / 0.040 | 0.190 / 0.190 | 0.155 / 0.154 | |||
In all full models, it can be seen that dairy is insignificant.
rlm_model_table <- tab_model(BMI_rlm,
waist_rlm,
w2hratio_rlm,
show.ci = F)
rlm_model_table
| Â | BMI | WAIST CIRCUMFERENCE | w 2 hratio | |||
|---|---|---|---|---|---|---|
| Predictors | Estimates | p | Estimates | p | Estimates | p |
| (Intercept) | 25.45 | <0.001 | 87.83 | <0.001 | 0.48 | <0.001 |
| avg veges | -0.11 | <0.001 | -0.28 | <0.001 | -0.00 | <0.001 |
| avg fruit | -0.17 | <0.001 | -0.56 | <0.001 | -0.00 | <0.001 |
| avg dairy | 0.09 | 0.126 | 0.22 | 0.167 | -0.00 | 0.358 |
| avg meat | 0.22 | <0.001 | 0.37 | 0.002 | 0.00 | 0.002 |
| avg grains | -0.19 | <0.001 | -0.41 | <0.001 | -0.00 | <0.001 |
| AGE | 0.06 | <0.001 | 0.25 | <0.001 | 0.00 | <0.001 |
| SEX [2] | -1.32 | <0.001 | -10.78 | <0.001 | -0.02 | <0.001 |
| Observations | 7811 | 7811 | 7811 | |||
# getting total observations in dataset
n = nrow(final)
# bmi_AIC is just full model
# bmi_AIC <- step(bmi_full, direction = "backward", trace = F)
bmi_BIC <- step(bmi_full, direction = "backward", trace = F, k = log(n))
# fwd AIC model same as back
# step(bmi_null, scope = list(lower = bmi_null, upper = bmi_full), direction = "forward", trace = F)
tab_model(bmi_full, bmi_BIC)
| Â | BMI | BMI | ||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 25.98 | 25.50 – 26.45 | <0.001 | 26.05 | 25.58 – 26.51 | <0.001 |
| avg veges | -0.12 | -0.19 – -0.06 | <0.001 | -0.12 | -0.19 – -0.06 | <0.001 |
| avg fruit | -0.20 | -0.29 – -0.11 | <0.001 | -0.19 | -0.28 – -0.10 | <0.001 |
| avg dairy | 0.10 | -0.03 – 0.23 | 0.141 | |||
| avg meat | 0.22 | 0.12 – 0.31 | <0.001 | 0.22 | 0.12 – 0.31 | <0.001 |
| avg grains | -0.17 | -0.23 – -0.12 | <0.001 | -0.16 | -0.22 – -0.11 | <0.001 |
| AGE | 0.05 | 0.05 – 0.06 | <0.001 | 0.05 | 0.05 – 0.06 | <0.001 |
| SEX [2] | -0.66 | -0.91 – -0.41 | <0.001 | -0.66 | -0.91 – -0.41 | <0.001 |
| Observations | 7811 | 7811 | ||||
| R2 / R2 adjusted | 0.041 / 0.040 | 0.041 / 0.040 | ||||
Table above shows the coefficients for the full model and the BIC backward selection BMI model.
### waist circumeference
waist_AIC <- step(waist_full, direction = "backward", trace = F)
# waist_BIC same as AIC
# waist_BIC <- step(waist_full, direction = "backward", trace = F, k = log(n))
# backward selection same as fwd
# step(waist_null, scope = list(lower = waist_null, upper = waist_full), direction = "forward", trace = F)
tab_model(waist_full, waist_AIC)
| Â | WAIST CIRCUMFERENCE | WAIST CIRCUMFERENCE | ||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 89.07 | 87.91 – 90.23 | <0.001 | 89.20 | 88.06 – 90.34 | <0.001 |
| avg veges | -0.29 | -0.44 – -0.13 | <0.001 | -0.28 | -0.44 – -0.13 | <0.001 |
| avg fruit | -0.57 | -0.79 – -0.35 | <0.001 | -0.56 | -0.78 – -0.34 | <0.001 |
| avg dairy | 0.18 | -0.13 – 0.50 | 0.255 | |||
| avg meat | 0.37 | 0.13 – 0.60 | 0.002 | 0.37 | 0.13 – 0.60 | 0.002 |
| avg grains | -0.38 | -0.51 – -0.24 | <0.001 | -0.36 | -0.49 – -0.22 | <0.001 |
| AGE | 0.23 | 0.22 – 0.25 | <0.001 | 0.23 | 0.22 – 0.25 | <0.001 |
| SEX [2] | -10.08 | -10.69 – -9.48 | <0.001 | -10.08 | -10.69 – -9.48 | <0.001 |
| Observations | 7811 | 7811 | ||||
| R2 / R2 adjusted | 0.190 / 0.190 | 0.190 / 0.190 | ||||
### w2h ratio
w2hratio_AIC <- step(w2hratio_full, direction = "backward", trace = F)
w2hratio_BIC <- step(w2hratio_full, direction = "backward", trace = F, k = log(n))
# backward selection same as fwd
# step(w2hratio_null, scope = list(lower = w2hratio_null, upper = waist_full), direction = "forward", trace = F)
tab_model(w2hratio_full, w2hratio_AIC, w2hratio_BIC)
| Â | w 2 hratio | w 2 hratio | w 2 hratio | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 0.49 | 0.48 – 0.49 | <0.001 | 0.49 | 0.48 – 0.49 | <0.001 | 0.49 | 0.48 – 0.50 | <0.001 |
| avg veges | -0.00 | -0.00 – -0.00 | <0.001 | -0.00 | -0.00 – -0.00 | <0.001 | -0.00 | -0.00 – -0.00 | <0.001 |
| avg fruit | -0.00 | -0.01 – -0.00 | <0.001 | -0.00 | -0.01 – -0.00 | <0.001 | -0.00 | -0.01 – -0.00 | <0.001 |
| avg dairy | -0.00 | -0.00 – 0.00 | 0.275 | ||||||
| avg meat | 0.00 | 0.00 – 0.00 | 0.007 | 0.00 | 0.00 – 0.00 | 0.007 | |||
| avg grains | -0.00 | -0.00 – -0.00 | <0.001 | -0.00 | -0.00 – -0.00 | <0.001 | -0.00 | -0.00 – -0.00 | <0.001 |
| AGE | 0.00 | 0.00 – 0.00 | <0.001 | 0.00 | 0.00 – 0.00 | <0.001 | 0.00 | 0.00 – 0.00 | <0.001 |
| SEX [2] | -0.02 | -0.02 – -0.01 | <0.001 | -0.02 | -0.02 – -0.01 | <0.001 | -0.02 | -0.02 – -0.01 | <0.001 |
| Observations | 7811 | 7811 | 7811 | ||||||
| R2 / R2 adjusted | 0.155 / 0.154 | 0.155 / 0.154 | 0.154 / 0.153 | ||||||
Table above shows the coefficients for the full model and the AIC backward selection waist circumference model.
params = trainControl(method = "cv", number = 10, verboseIter = FALSE)
bmi_full <- lm(BMI ~ avg_veges + avg_fruit + avg_dairy + avg_meat +
avg_grains + AGE + SEX,
method = "lm",
data = final,
trControl = params)
summary(bmi_full)
##
## Call:
## lm(formula = BMI ~ avg_veges + avg_fruit + avg_dairy + avg_meat +
## avg_grains + AGE + SEX, data = final, method = "lm", trControl = params)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.835 -3.793 -0.854 2.789 35.088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.975387 0.243339 106.746 < 2e-16 ***
## avg_veges -0.124598 0.032734 -3.806 0.000142 ***
## avg_fruit -0.198827 0.046527 -4.273 1.95e-05 ***
## avg_dairy 0.097887 0.066491 1.472 0.141009
## avg_meat 0.216046 0.049438 4.370 1.26e-05 ***
## avg_grains -0.174983 0.028362 -6.170 7.19e-10 ***
## AGE 0.053893 0.003592 15.005 < 2e-16 ***
## SEX2 -0.663969 0.127662 -5.201 2.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.445 on 7803 degrees of freedom
## Multiple R-squared: 0.04127, Adjusted R-squared: 0.04041
## F-statistic: 47.99 on 7 and 7803 DF, p-value: < 2.2e-16
model.diag.metrics <- augment(bmi_full)
head(model.diag.metrics)
## # A tibble: 6 × 15
## .rownames BMI avg_veges avg_fruit avg_dairy avg_meat avg_grains AGE SEX
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <fct>
## 1 1 20.9 3.62 0.412 2.51 4.96 6.11 46 2
## 2 2 29.8 2.95 2.03 1.44 2.19 3.45 77 1
## 3 3 24.4 9.08 1.34 1.87 3.04 2.45 55 2
## 4 4 21.4 5.87 5.06 3.64 2.79 2.35 59 2
## 5 6 25.1 4.54 0.154 1.59 1.56 1.67 35 2
## 6 8 30.7 0.706 0 1.77 1.59 3.56 35 1
## # … with 6 more variables: .fitted <dbl>, .resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
plot1 <- ggplot(model.diag.metrics, aes(avg_meat, BMI)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = avg_meat, yend = .fitted), color = "red", size = 0.3)
plot2 <- ggplot(model.diag.metrics, aes(avg_dairy, BMI)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = avg_dairy, yend = .fitted), color = "red", size = 0.3)
plot3 <- ggplot(model.diag.metrics, aes(avg_grains, BMI)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = avg_grains, yend = .fitted), color = "red", size = 0.3)
plot4 <- ggplot(model.diag.metrics, aes(avg_veges, BMI)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = avg_veges, yend = .fitted), color = "red", size = 0.3)
grid.arrange(plot1, plot2, plot3, plot4, ncol=2)
par(mfrow=c(2,2))
plot(bmi_full)
Residuals vs. fitted is relatively linear, indicating a linear relationship.
Normal Q-Q plot, standardised residuals follows closely the normal line, indicating normality of residuals.
Scale-location has no obvious non-linear pattern, indicating homoscedasticity.
In the residuals vs. leverage graph a small number of points are outside of Cook’s distance, indicating they are outliers. Due to the relatively small proportion of outliers, we make the assumption that there are no significant outlier effects.
vis.obj_bmi <- vis(bmi_full, B = 150, redundant = TRUE, nbest = "all",
seed = 2021)
vis.obj_waist <- vis(waist_full, B = 150, redundant = TRUE, nbest = "all",
seed = 2021)
vis.obj_w2hratio <- vis(w2hratio_full, B = 150, redundant = TRUE, nbest = "all",
seed = 2021)
plot(vis.obj_bmi,
interactive = FALSE,
highlight = "avg_dairy",
which = "boot")
Figure 2: Model stability plot of BMI model
plot(vis.obj_bmi,
interactive = FALSE,
highlight = "avg_dairy",
which = "vip")
Figure 3: Variable inclusion plot of BMI model
Shows that dairy is close to the Redundant variable (RV) curve, hence dairy does not add any significant information to the model.
plot(vis.obj_waist,
interactive = FALSE,
highlight = "avg_dairy",
which = "boot")
Figure 4: Model stability plot of Waist Circumference model
plot(vis.obj_waist,
interactive = FALSE,
highlight = "avg_dairy",
which = "vip")
Figure 5: VIP of Waist Circumference model
plot(vis.obj_w2hratio,
interactive = FALSE,
highlight = "avg_dairy",
which = "boot")
Figure 6: Model stability plot of WtHR model
plot(vis.obj_w2hratio,
interactive = FALSE,
highlight = "avg_dairy",
which = "vip")
Figure 7: VIP of WtHR model
Shows that dairy is close to the Redundant variable (RV) curve, hence dairy does not add any significant information to the model.
From all of these above plots, we can see that models containing avg_dairy are only dominant when there are a large number of parameters in the model (i.e. close to the full model). From this, we can see that avg_dairy is not particularly stable, and therefore suggests it should not be included in a final model.
An ANOVA was conducted of the full model against the respective AIC/BIC models for each obesity measure. The ANOVA can be used to identify whether the simpler (backward selection) model should be retained or the more complex full model should be used instead. As each of the BIC/AIC models are nested within the full models, the ANOVA can be used.
anova(bmi_BIC, bmi_full) %>% round(2) %>%
kbl(caption = "ANOVA table for BIC and full BMI models") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 7804 | 231389.3 | NA | NA | NA | NA |
| 7803 | 231325.0 | 1 | 64.25 | 2.17 | 0.14 |
anova(waist_AIC, waist_full) %>% round(2) %>%
kbl(caption = "ANOVA table for AIC and full waist circumference models") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 7804 | 1366547 | NA | NA | NA | NA |
| 7803 | 1366320 | 1 | 227.03 | 1.3 | 0.25 |
anova(w2hratio_BIC, w2hratio_AIC, w2hratio_full) %>% round(2) %>%
kbl(caption = "ANOVA table for BIC, AIC and full WtHR models") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 7805 | 49.33 | NA | NA | NA | NA |
| 7804 | 49.28 | 1 | 0.05 | 7.20 | 0.01 |
| 7803 | 49.27 | 1 | 0.01 | 1.19 | 0.27 |
params = trainControl(method = "cv", number = 10, verboseIter = FALSE)
set.seed(2021)
cv_objects = list(
bmi_full = train(BMI ~ avg_veges + avg_fruit + avg_dairy + avg_meat +
avg_grains + AGE + SEX,
method = "lm",
data = final,
trControl = params),
bmi_BIC = train(BMI ~ avg_veges + avg_fruit + avg_meat + avg_grains +
AGE + SEX,
method = "lm",
data = final,
trControl = params),
bmi_tree = train(BMI ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX,
data = final,
method = 'rpart',
tuneLength = 15,
trControl = params),
waist_full = train(WAIST_CIRCUMFERENCE ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX,
method = "lm",
data = final,
trControl = params),
waist_AIC = train(WAIST_CIRCUMFERENCE ~ avg_veges + avg_fruit + avg_meat + avg_grains + AGE + SEX,
method = "lm",
data = final,
trControl = params),
waist_tree = train(WAIST_CIRCUMFERENCE ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX,
data = final,
method = 'rpart',
tuneLength = 15,
trControl = params),
w2hratio_full = train(w2hratio ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX,
method = "lm",
data = final,
trControl = params),
w2hratio_AIC = train(w2hratio ~ avg_veges + avg_fruit + avg_meat + avg_grains + AGE + SEX,
method = "lm",
data = final,
trControl = params),
w2hratio_BIC = train(w2hratio ~ avg_veges + avg_fruit + avg_grains +
AGE + SEX,
method = "lm",
data = final,
trControl = params),
w2hratio_tree = train(w2hratio ~ avg_veges + avg_fruit + avg_dairy + avg_meat + avg_grains + AGE + SEX,
data = final,
method = 'rpart',
tuneLength = 15,
trControl = params)
)
cv_results_bmi = resamples(cv_objects[1:3], metric = "RMSE")
ggplot(cv_results_bmi) +
theme_bw() +
labs(x = "Models", y = "RMSE", title = "10-Fold CV Performance")
cv_results_waist = resamples(cv_objects[4:6], metric = "RMSE")
ggplot(cv_results_waist) +
theme_bw() +
labs(x = "Models", y = "RMSE", title = "10-Fold CV Performance")
cv_results_w2hratio = resamples(cv_objects[7:10], metric = "RMSE")
ggplot(cv_results_w2hratio) +
theme_bw() +
labs(x = "Models", y = "RMSE", title = "10-Fold CV Performance")
10 Fold cross validation was done on all models using the caret package in R. A decision tree was also included in the cross validation to compare with the out of sample performance of the linear regression models.
Standardized beta coefficients were calculated for each of the final models, (BMI BIC model, AIC waist circumference, AIC WtHR).
# creating scaled df
sex <- final$SEX
scaled_df <- final %>%
dplyr::select(-c(SEX)) %>%
scale() %>% as.data.frame()
scaled_df$SEX <- sex
# lm scaled to get standardized betas from final models to compare
bmi_scaled <- lm(BMI ~ avg_veges + avg_fruit + avg_meat + avg_grains + AGE + SEX, dat = scaled_df)
w2h_scaled <- lm(w2hratio ~ avg_veges + avg_fruit + avg_meat + avg_grains + AGE + SEX, dat = scaled_df)
waist_scaled <- lm(WAIST_CIRCUMFERENCE ~ avg_veges + avg_fruit + avg_meat + avg_grains + AGE + SEX, dat = scaled_df)
st_beta_table <- tab_model(bmi_scaled, w2h_scaled, waist_scaled)
st_beta_table
| Â | BMI | w 2 hratio | WAIST CIRCUMFERENCE | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 0.06 | 0.03 – 0.09 | <0.001 | 0.09 | 0.06 – 0.12 | <0.001 | 0.36 | 0.33 – 0.39 | <0.001 |
| avg veges | -0.05 | -0.07 – -0.02 | <0.001 | -0.05 | -0.08 – -0.03 | <0.001 | -0.04 | -0.06 – -0.02 | <0.001 |
| avg fruit | -0.05 | -0.07 – -0.03 | <0.001 | -0.07 | -0.09 – -0.05 | <0.001 | -0.05 | -0.07 – -0.03 | <0.001 |
| avg meat | 0.05 | 0.03 – 0.08 | <0.001 | 0.03 | 0.01 – 0.05 | 0.007 | 0.03 | 0.01 – 0.06 | 0.002 |
| avg grains | -0.07 | -0.10 – -0.05 | <0.001 | -0.06 | -0.08 – -0.04 | <0.001 | -0.06 | -0.08 – -0.04 | <0.001 |
| AGE | 0.17 | 0.15 – 0.19 | <0.001 | 0.38 | 0.36 – 0.40 | <0.001 | 0.28 | 0.26 – 0.30 | <0.001 |
| SEX [2] | -0.12 | -0.16 – -0.07 | <0.001 | -0.18 | -0.22 – -0.14 | <0.001 | -0.69 | -0.73 – -0.64 | <0.001 |
| Observations | 7811 | 7811 | 7811 | ||||||
| R2 / R2 adjusted | 0.041 / 0.040 | 0.155 / 0.154 | 0.190 / 0.190 | ||||||
food_groups <- final %>% dplyr::select(avg_dairy,
avg_fruit,
avg_grains,
avg_meat,
avg_veges)
numerics <- final %>% dplyr::select(-SEX)
# principal component analysis
res.pca <- prcomp(
food_groups,
center = T,
scale = T
)
# extracting pc1 and pc2
pcs = res.pca$x[, c(1,2)] %>% as.data.frame()
## keeping original data
pcs1 <- pcs
## Creating k-means clustering model
fit_cluster_kmeans_pca <- kmeans(scale(pcs), 5)
# Assigning the result to the data used to create the tsne
pcs1$cl_kmeans <- factor(fit_cluster_kmeans_pca$cluster)
# Creating hierarchical cluster model
fit_cluster_hierarchical_pca <- hclust(dist(scale(pcs)), method = "ward.D2")
plot(fit_cluster_hierarchical_pca, hang = -1)
# Assigning the result to the data used
pcs1$cl_hierarchical <- factor(cutree(fit_cluster_hierarchical_pca, k=5))
pcs1$BMI <- final$BMI
pcs1$obese <- pcs1$BMI >= 30
cluster_ggplot <- pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=obese),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_hierarchical,
label=cl_hierarchical),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none")
prop_obese <- pcs1 %>% group_by(cl_hierarchical) %>%
summarise(prop_obese = sum(as.numeric(obese))/n()) %>% kbl()
Table above shows how many standard deviations away from 0 each cluster is in each variable.
# obesity measured via BMI >= 30
pcs1$BMI <- final$BMI
pcs1$obese_BMI <- pcs1$BMI >= 30
# obesity measured via waist circumference
final$obese_waist_circum = NA
for (i in 1:nrow(final)) {
if (final$SEX[i] == "2") {
if (final$WAIST_CIRCUMFERENCE[i] > 88) {
final$obese_waist_circum[i] = 1
} else {
final$obese_waist_circum[i] = 0
}
} else if (final$SEX[i] == "1") {
if (final$WAIST_CIRCUMFERENCE[i] > 102) {
final$obese_waist_circum[i] = 1
} else {
final$obese_waist_circum[i] = 0
}
}
}
pcs1$obese_waist_circum <- final$obese_waist_circum
# obesity measured using w2hratio
final$obese_w2hratio = NA
for (i in 1:nrow(final)) {
if (final$SEX[i] == "2") {
if (final$w2hratio[i] >= 0.58) {
final$obese_w2hratio[i] = 1
} else {
final$obese_w2hratio[i] = 0
}
} else if (final$SEX[i] == "1") {
if (final$w2hratio[i] >= 0.63) {
final$obese_w2hratio[i] = 1
} else {
final$obese_w2hratio[i] = 0
}
}
}
pcs1$obese_w2hratio <- final$obese_w2hratio
data <- scale(pcs)
k.max <- 15
wss <- sapply(1:k.max,
function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares", main="k-means elbow")
#get dendrogram
plot(fit_cluster_hierarchical_pca, hang = -1)
pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=obese_BMI),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_hierarchical,
label=cl_hierarchical),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none") + ggtitle("Hierarchical, BMI")
pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=factor(as.character(obese_waist_circum))),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_hierarchical,
label=cl_hierarchical),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none") + ggtitle("Hierarchical, waist circumference")
pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=obese_w2hratio %>% as.character() %>% as.factor()),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_hierarchical,
label=cl_hierarchical),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none") + ggtitle("Hierarchical, waist to height ratio")
pcs1 %>% group_by(cl_hierarchical) %>%
summarise(prop_obese_BMI = (sum(as.numeric(obese_BMI))/n()) %>% round(2)) %>%
kbl(caption = "Proportion obese in each cluster according to BMI") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_hierarchical | prop_obese_BMI |
|---|---|
| 1 | 0.28 |
| 2 | 0.26 |
| 3 | 0.21 |
| 4 | 0.30 |
| 5 | 0.25 |
pcs1 %>% group_by(cl_hierarchical) %>%
summarise(prop_obese_waist_circum = (sum(as.numeric(obese_waist_circum))/n()) %>% round(2)) %>%
kbl(caption = "Proportion obese in each cluster according to BMI") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_hierarchical | prop_obese_waist_circum |
|---|---|
| 1 | 0.40 |
| 2 | 0.39 |
| 3 | 0.32 |
| 4 | 0.45 |
| 5 | 0.34 |
pcs1 %>% group_by(cl_hierarchical) %>%
summarise(prop_obese_w2hratio = (sum(as.numeric(obese_w2hratio))/n()) %>% round(2)) %>%
kbl(caption = "Proportion obese in each cluster according to BMI") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_hierarchical | prop_obese_w2hratio |
|---|---|
| 1 | 0.24 |
| 2 | 0.26 |
| 3 | 0.17 |
| 4 | 0.30 |
| 5 | 0.21 |
scaled_numerics <- scale(numerics)
scaled_numerics <- as.data.frame(scaled_numerics)
scaled_numerics$cl_hierarchical <- pcs1$cl_hierarchical
clusters_characteristics <- scaled_numerics %>% group_by(cl_hierarchical) %>%
summarise(avg_veges = mean(avg_veges) %>% round(2),
avg_dairy = mean(avg_dairy) %>% round(2),
avg_fruit = mean(avg_fruit) %>% round(2),
avg_meat = mean(avg_meat) %>% round(2),
avg_grains = mean(avg_grains) %>% round(2),
avg_age = mean(AGE) %>% round(2),
avg_bmi = mean(BMI) %>% round(2),
avg_waist = mean(WAIST_CIRCUMFERENCE) %>% round(2),
avg_w2hratio = mean(w2hratio) %>% round(2))
clusters_characteristics %>%
kbl(caption = "Cluster characteristics from hierarchical clustering. Values show the number of standard deviations for each cluster away from overall mean of foodgroup in the dataset.") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_hierarchical | avg_veges | avg_dairy | avg_fruit | avg_meat | avg_grains | avg_age | avg_bmi | avg_waist | avg_w2hratio |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.94 | -0.30 | 0.15 | 1.02 | 0.06 | 0.08 | 0.01 | 0.06 | 0.02 |
| 2 | -0.27 | 0.55 | 0.05 | -0.33 | 0.26 | -0.01 | -0.02 | -0.03 | -0.03 |
| 3 | 0.55 | 0.88 | 0.93 | 0.48 | 1.10 | -0.02 | -0.13 | -0.07 | -0.17 |
| 4 | -0.56 | -0.55 | -0.42 | -0.55 | -0.57 | -0.02 | 0.04 | 0.00 | 0.06 |
| 5 | -0.11 | 2.55 | 0.69 | -0.22 | 1.36 | -0.21 | -0.04 | 0.02 | -0.12 |
fit_cluster_kmeans_pca$centers
## PC1 PC2
## 1 0.1290055 -0.7673895
## 2 -0.2666472 0.6721143
## 3 -1.3463290 -1.3397889
## 4 -1.6525479 1.6419573
## 5 0.9946403 0.1433501
fit_cluster_kmeans_pca$withinss
## [1] 759.0425 1015.6125 1194.7680 848.8755 812.0312
We can see that the clusters have reasonably similar variance
pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=obese_BMI),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_kmeans,
label=cl_kmeans),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none") + ggtitle("K-means, BMI")
pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=factor(as.character(obese_waist_circum))),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_kmeans,
label=cl_kmeans),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none") + ggtitle("K-means, waist circumference")
pcs1 %>%
ggplot(aes(x = PC1,
y = PC2))+
geom_point(aes(color=obese_w2hratio %>% as.character() %>% as.factor()),
alpha = 0.5)+
scale_color_manual(values = c("#964B00", "#8F00FF", "#0000FF", "#00FF00",
"#FFA500", "#808080", "#FF0000")) +
geom_mark_ellipse(aes(color = cl_kmeans,
label=cl_kmeans),
expand = unit(0.5,"mm"),
label.buffer = unit(-5, 'mm')) +
theme(legend.position = "none") + ggtitle("K-means, waist to height ratio")
pcs1 %>% group_by(cl_kmeans) %>%
summarise(prop_obese_BMI = (sum(as.numeric(obese_BMI))/n()) %>% round(2)) %>%
kbl(caption = "Proportion obese in each cluster according to BMI with kmeans") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_kmeans | prop_obese_BMI |
|---|---|
| 1 | 0.27 |
| 2 | 0.28 |
| 3 | 0.23 |
| 4 | 0.25 |
| 5 | 0.30 |
pcs1 %>% group_by(cl_kmeans) %>%
summarise(prop_obese_waist_circum = (sum(as.numeric(obese_waist_circum))/n()) %>% round(2)) %>%
kbl(caption = "Proroption obese in each cluster according to waist circumference with kmeans") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_kmeans | prop_obese_waist_circum |
|---|---|
| 1 | 0.41 |
| 2 | 0.41 |
| 3 | 0.32 |
| 4 | 0.37 |
| 5 | 0.45 |
pcs1 %>% group_by(cl_kmeans) %>%
summarise(prop_obese_w2hratio = (sum(as.numeric(obese_w2hratio))/n()) %>% round(2)) %>%
kbl(caption = "Proroption obese in each cluster according to WtHR with kmeans") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_kmeans | prop_obese_w2hratio |
|---|---|
| 1 | 0.28 |
| 2 | 0.25 |
| 3 | 0.18 |
| 4 | 0.20 |
| 5 | 0.30 |
scaled_numerics <- scale(numerics)
scaled_numerics <- as.data.frame(scaled_numerics)
scaled_numerics$cl_kmeans <- pcs1$cl_kmeans
scaled_numerics %>% group_by(cl_kmeans) %>%
summarise(avg_veges = mean(avg_veges) %>% round(2),
avg_dairy = mean(avg_dairy) %>% round(2),
avg_fruit = mean(avg_fruit) %>% round(2),
avg_meat = mean(avg_meat) %>% round(2),
avg_grains = mean(avg_grains) %>% round(2),
avg_age = mean(AGE) %>% round(2),
avg_bmi = mean(BMI) %>% round(2),
avg_waist = mean(WAIST_CIRCUMFERENCE) %>% round(2),
avg_w2hratio = mean(w2hratio) %>% round(2)) %>%
kbl(caption = "Characteristics of Each kmeans cluster") %>%
kable_classic(full_width = F, html_font = "Cambria")
| cl_kmeans | avg_veges | avg_dairy | avg_fruit | avg_meat | avg_grains | avg_age | avg_bmi | avg_waist | avg_w2hratio |
|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.41 | 0.46 | -0.03 | -0.46 | 0.14 | 0.00 | 0.01 | -0.03 | 0.00 |
| 2 | 0.46 | -0.29 | 0.05 | 0.53 | -0.02 | 0.08 | 0.01 | 0.03 | 0.03 |
| 3 | 0.28 | 1.59 | 0.95 | 0.20 | 1.28 | -0.11 | -0.12 | -0.04 | -0.18 |
| 4 | 1.92 | -0.06 | 0.53 | 1.89 | 0.56 | 0.01 | -0.03 | 0.06 | -0.06 |
| 5 | -0.61 | -0.68 | -0.50 | -0.59 | -0.69 | -0.02 | 0.04 | -0.01 | 0.06 |
centers <- fit_cluster_kmeans_pca$centers[fit_cluster_kmeans_pca$cluster, ] # "centers" is a data frame of 3 centers but the length of iris dataset so we can canlculate distance difference easily.
distances <- sqrt(rowSums((pcs - centers)^2))
#outliers in the cluster
outliers <- order(distances, decreasing=T)[1:10]
outliers_pcs <- pcs[outliers,]
iqr_pc1 <- IQR(pcs[,"PC1"])
iqr_pc2 <- IQR(pcs[,"PC2"])
mean_pc1 <- mean(pcs[,"PC1"])
mean_pc2 <- mean(pcs[,"PC2"])
pc1_outliers = (pcs$PC1 - mean_pc1) > 1.5*iqr_pc1
pc2_outliers = (pcs$PC2 - mean_pc2) > 1.5*iqr_pc2
sum(pc1_outliers)/nrow(pcs)
## [1] 0.0008961721
sum(pc2_outliers)/nrow(pcs)
## [1] 0.03917552
food_groups <- final %>% dplyr::select(avg_dairy,
avg_fruit,
avg_grains,
avg_meat,
avg_veges)
pca <- preProcess(x=food_groups, method="pca", pcaComp=2)
pca_svm <- predict(pca, final)
pca_svm$obese_bmi <- pcs1$obese_BMI
pca_svm$obese_waist_circum <- pcs1$obese_waist_circum
pca_svm$obese_w2hratio <- pcs1$obese_w2hratio
svm_w2hratio <- svm(factor(obese_w2hratio) ~., data = pca_svm, kernel="polynomial")
svm_circumf <- svm(factor(obese_waist_circum) ~ ., data=pca_svm, kernel="polynomial")
svm_bmi <- svm(factor(obese_bmi) ~., data=pca_svm, kernel="polynomial")
summary(svm_w2hratio)
##
## Call:
## svm(formula = factor(obese_w2hratio) ~ ., data = pca_svm, kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 857
##
## ( 434 423 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
summary(svm_circumf)
##
## Call:
## svm(formula = factor(obese_waist_circum) ~ ., data = pca_svm, kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 1641
##
## ( 820 821 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
summary(svm_bmi)
##
## Call:
## svm(formula = factor(obese_bmi) ~ ., data = pca_svm, kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## coef.0: 0
##
## Number of Support Vectors: 1137
##
## ( 576 561 )
##
##
## Number of Classes: 2
##
## Levels:
## FALSE TRUE
plot(svm_w2hratio, data=pca_svm, PC2 ~ PC1)
plot(svm_circumf, data=pca_svm, PC2 ~ PC1)
plot(svm_bmi, data=pca_svm, PC2 ~ PC1)